{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Formatting and deduping data\n", "\n", "Formatting columns and removing duplicates is an important part of data preparation.\n", "\n", "Preparing data for analysis is a crucial step in any data science project. One aspect of data preparation is formatting columns and removing duplicates. Inaccurate or inconsistent formatting of columns can make it difficult to analyze data or even result in incorrect results. Similarly, duplicate data can skew analysis and lead to inaccurate conclusions. \n", "\n", "This notebook will explore how to format columns in Pandas dataframes to ensure data accuracy and consistency. We will also discuss detecting and removing duplicate data and handling missing values in columns. These techniques ensure data is adequately prepared for analysis and modelling, leading to more accurate and reliable results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How To" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/housing.csv\", dtype={\"housing_median_age\": int,\"ocean_proximity\": \"category\"})" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "longitude float64\n", "latitude float64\n", "housing_median_age int32\n", "total_rooms float64\n", "total_bedrooms float64\n", "population float64\n", "households float64\n", "median_income float64\n", "median_house_value float64\n", "ocean_proximity category\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "total_rooms | \n", "total_bedrooms | \n", "population | \n", "households | \n", "median_income | \n", "median_house_value | \n", "ocean_proximity | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "-122.23 | \n", "37.88 | \n", "41 | \n", "880.0 | \n", "129.0 | \n", "322.0 | \n", "126.0 | \n", "8.3252 | \n", "452600.0 | \n", "NEAR BAY | \n", "
1 | \n", "-122.22 | \n", "37.86 | \n", "21 | \n", "7099.0 | \n", "1106.0 | \n", "2401.0 | \n", "1138.0 | \n", "8.3014 | \n", "358500.0 | \n", "NEAR BAY | \n", "
2 | \n", "-122.24 | \n", "37.85 | \n", "52 | \n", "1467.0 | \n", "190.0 | \n", "496.0 | \n", "177.0 | \n", "7.2574 | \n", "352100.0 | \n", "NEAR BAY | \n", "
3 | \n", "-122.25 | \n", "37.85 | \n", "52 | \n", "1274.0 | \n", "235.0 | \n", "558.0 | \n", "219.0 | \n", "5.6431 | \n", "341300.0 | \n", "NEAR BAY | \n", "
4 | \n", "-122.25 | \n", "37.85 | \n", "52 | \n", "1627.0 | \n", "280.0 | \n", "565.0 | \n", "259.0 | \n", "3.8462 | \n", "342200.0 | \n", "NEAR BAY | \n", "
\n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "total_rooms | \n", "total_bedrooms | \n", "population | \n", "households | \n", "median_income | \n", "median_house_value | \n", "ocean_proximity | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "-122.23 | \n", "37.88 | \n", "41 | \n", "880 | \n", "129.0 | \n", "322 | \n", "126 | \n", "8.3252 | \n", "452600 | \n", "NEAR BAY | \n", "
1 | \n", "-122.22 | \n", "37.86 | \n", "21 | \n", "7099 | \n", "1106.0 | \n", "2401 | \n", "1138 | \n", "8.3014 | \n", "358500 | \n", "NEAR BAY | \n", "
2 | \n", "-122.24 | \n", "37.85 | \n", "52 | \n", "1467 | \n", "190.0 | \n", "496 | \n", "177 | \n", "7.2574 | \n", "352100 | \n", "NEAR BAY | \n", "
3 | \n", "-122.25 | \n", "37.85 | \n", "52 | \n", "1274 | \n", "235.0 | \n", "558 | \n", "219 | \n", "5.6431 | \n", "341300 | \n", "NEAR BAY | \n", "
4 | \n", "-122.25 | \n", "37.85 | \n", "52 | \n", "1627 | \n", "280.0 | \n", "565 | \n", "259 | \n", "3.8462 | \n", "342200 | \n", "NEAR BAY | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
20635 | \n", "-121.09 | \n", "39.48 | \n", "25 | \n", "1665 | \n", "374.0 | \n", "845 | \n", "330 | \n", "1.5603 | \n", "78100 | \n", "INLAND | \n", "
20636 | \n", "-121.21 | \n", "39.49 | \n", "18 | \n", "697 | \n", "150.0 | \n", "356 | \n", "114 | \n", "2.5568 | \n", "77100 | \n", "INLAND | \n", "
20637 | \n", "-121.22 | \n", "39.43 | \n", "17 | \n", "2254 | \n", "485.0 | \n", "1007 | \n", "433 | \n", "1.7000 | \n", "92300 | \n", "INLAND | \n", "
20638 | \n", "-121.32 | \n", "39.43 | \n", "18 | \n", "1860 | \n", "409.0 | \n", "741 | \n", "349 | \n", "1.8672 | \n", "84700 | \n", "INLAND | \n", "
20639 | \n", "-121.24 | \n", "39.37 | \n", "16 | \n", "2785 | \n", "616.0 | \n", "1387 | \n", "530 | \n", "2.3886 | \n", "89400 | \n", "INLAND | \n", "
20640 rows × 10 columns
\n", "